-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[deepspeed] zero inference #14253
[deepspeed] zero inference #14253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new mode!
|
||
Inference: | ||
|
||
1. DeepSpeed ZeRO Inference - same as Training but doesn't require Optimizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a real sentence? I don't understand what this means.
@@ -111,6 +111,24 @@ def get_value(self, ds_key_long, default=None): | |||
return default | |||
return config.get(ds_key, default) | |||
|
|||
def del_config_sub_tree(self, ds_key_long, must_exist=False): | |||
config = self.config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method deservers a docstring.
src/transformers/deepspeed.py
Outdated
config = config.get(node) | ||
if config is None: | ||
if must_exist: | ||
raise ValueError(f"Can't find {ds_key_long} entry in the config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth printing the config as well?
raise ValueError(f"Can't find {ds_key_long} entry in the config") | |
raise ValueError(f"Can't find {ds_key_long} entry in the config {config}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that won't work, but I will fix it to dump the config, thank you.
hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps) | ||
|
||
# resume config update - some bits like `model` and `num_training_steps` only become available during train | ||
def deepspeed_optim_sched(trainer, hf_deepspeed_config, args, num_training_steps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new method deserves a docstring.
Co-authored-by: Sylvain Gugger <[email protected]>
Thanks a lot for the review and the suggestions, Sylvain. All addressed, please have another look at your convenience. Thank you. |
This PR extends HF/DS integration to support Deepspeed Zero inference. Now we don't need to waste gpu memory on allocating the optimizer/scheduler and then dropping them. And in some cases enabling what was not possible before - in case the user doesn't have the extra gpu memory and was getting OOM on inference.
Blocking events:
@jeffra, @sgugger
The CI errors seem to be irrelevant to this PR